OCPBUGS-85058: monitortests: allow etcd CO blips during TNF jobs on two-node upgrades#31138
OCPBUGS-85058: monitortests: allow etcd CO blips during TNF jobs on two-node upgrades#31138eggfoobar wants to merge 3 commits intoopenshift:mainfrom
Conversation
Teach legacy CVO monitor tests to treat cluster-etcd-operator condition reasons shaped as tnf-*_JobRunning as expected on DualReplica and HighlyAvailableArbiter topologies: Available=False during upgrade, and Progressing=True while machine-config is progressing. Add isTNFJobClusterOperatorReason helper and unit tests so we only match CEO job-running surfaces, not unrelated etcd failures. Co-authored-by: Cursor Composer <noreply@cursor.com> Signed-off-by: ehila <ehila@redhat.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@eggfoobar: This pull request references Jira Issue OCPBUGS-85058, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughReplaces the upgrade operator state-transition call with a topology-aware progressing-state path (passes patch-level flag and admin REST config), adds TNF-shaped JobRunning reason detection and two-node topology exceptions for operator conditions, and adds unit tests for the TNF-reason helper. Changes
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f7d036b0-4982-11f1-8f62-95a62a82cf83-0 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: eggfoobar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@eggfoobar: This pull request references Jira Issue OCPBUGS-85058, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go (1)
95-99:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t abort the whole monitor evaluation when upgrade-level lookup fails.
This new branch turns a transient
ClusterVersionread failure intoEvaluateTestsFromConstructedIntervalsreturning an error, which drops the JUnit cases already built above. Since this value only decides whether the progressing-state suite should be skipped for patch upgrades, please fall back to a conservative skip instead of failing the entire legacy monitor run.Suggested change
- level, err := getUpgradeLevel(w.adminRESTConfig) - if err != nil || level == unknownUpgradeLevel { - return nil, fmt.Errorf("failed to determine upgrade level: %w", err) - } - junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, w.adminRESTConfig)...) + isPatchLevelUpgrade := true // default to skipping this suite if the level cannot be determined + if level, err := getUpgradeLevel(w.adminRESTConfig); err == nil && level != unknownUpgradeLevel { + isPatchLevelUpgrade = level == patchUpgradeLevel + } + junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, isPatchLevelUpgrade, w.adminRESTConfig)...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go` around lines 95 - 99, The code currently returns an error when getUpgradeLevel fails which aborts EvaluateTestsFromConstructedIntervals and drops previously built JUnit cases; instead, swallow the transient failure and conservatively skip the progressing-state suite: call getUpgradeLevel(w.adminRESTConfig), and if it returns an error or unknownUpgradeLevel, do not return—set isPatch := false (or treat as non-patch); otherwise set isPatch := (level == patchUpgradeLevel); then call testUpgradeOperatorProgressingStateTransitions(finalIntervals, isPatch, w.adminRESTConfig) and append its results to junits; ensure no early return on getUpgradeLevel failure so previously accumulated junits are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 95-99: The code currently returns an error when getUpgradeLevel
fails which aborts EvaluateTestsFromConstructedIntervals and drops previously
built JUnit cases; instead, swallow the transient failure and conservatively
skip the progressing-state suite: call getUpgradeLevel(w.adminRESTConfig), and
if it returns an error or unknownUpgradeLevel, do not return—set isPatch :=
false (or treat as non-patch); otherwise set isPatch := (level ==
patchUpgradeLevel); then call
testUpgradeOperatorProgressingStateTransitions(finalIntervals, isPatch,
w.adminRESTConfig) and append its results to junits; ensure no early return on
getUpgradeLevel failure so previously accumulated junits are preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c4c2d66f-fa68-4d8c-ad75-24b48e3774e1
📒 Files selected for processing (3)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators_test.go
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a2372a0-49a2-11f1-8086-2c655c08170e-0 |
…low-ups Allow etcd Progressing while MCO runs when CEO reports EtcdMembers_MembersNotStarted (member still joining during fencing/replacement), in addition to existing tnf-*_JobRunning. On dual-replica, tolerate openshift-apiserver Available blips for APIServices_PreconditionNotReady during upgrade, and MCO-time Progressing for OperatorConfig_NewGeneration (openshift-apiserver) and APIServerDeployment_NewGeneration (authentication) as operators roll the control plane alongside machine-config. Co-authored-by: Cursor Composer <noreply@cursor.com> Signed-off-by: ehila <ehila@redhat.com>
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
1 similar comment
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
Scheduling required tests: |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ccb215f0-49d9-11f1-8175-d81295c02a18-0 |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/84b75970-49db-11f1-8000-23a6f1e43003-0 |
|
Job Failure Risk Analysis for sha: 1486616
|
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf374780-4a31-11f1-811e-4007cf2c7659-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b5594220-4a79-11f1-8280-9f028898ce3f-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2ca3c900-4b0f-11f1-98da-223abf407564-0 |
…xceptions Allow clusteroperator/etcd Progressing=True with reason NodeInstaller while MCO is progressing on DualReplica (static pod revision rollout overlapping MCO). During upgrade on DualReplica, tolerate openshift-samples Available=False with SampleUpsertsPending and Degraded=True with APIServerServiceUnavailableError when template writes hit transient apiserver errors. Co-authored-by: Cursor Composer <noreply@cursor.com> Signed-off-by: ehila <ehila@redhat.com>
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-upgrade |
|
@eggfoobar: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4ec3f670-4b48-11f1-8a29-a9bcfcee977c-0 |
|
Scheduling required tests: |
|
@eggfoobar: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Teach legacy CVO monitor tests to treat cluster-etcd-operator condition reasons shaped as tnf-*_JobRunning as expected on DualReplica and HighlyAvailableArbiter topologies: Available=False during upgrade, and Progressing=True while machine-config is progressing.
Add isTNFJobClusterOperatorReason helper and unit tests so we only match CEO job-running surfaces, not unrelated etcd failures.
Summary by CodeRabbit
Bug Fixes
Tests